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ABSTRACT 


^Supercomputers  capable  of  performing  extremely  niga 
speed  computation  nave  been  proposed  which  are  based  on  an 
architecture  known  as  data  flow.  Application  of  a  Petri 
net-based  metnodology  is  used  to  evaluate  the  performance 
attainable  by  such  an  architecture.  The  architecture 
evaluated  is  MIT's  cell  block  data  flow  architecture  vnlch 
is  being  developed  to  execute  the  applicative  programming 
language  7AL. 

Results  snow  that  for  the  data  flow  architecture  to 
achieve  its  goal  of  high  speed  computation,  intelligent 
multiprogramming  schemes  need  to  be  developed.  One  such 
scheme,  based  on  tne  notion  of  a  "concurrency  vector",  is 


introduced 
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I.  INTRODUCTION 


A.  BACKGROUND 

Despite  tae  orders-of-magnitude  increase  in  computation 
speed  that  nas  occurred  since  tne  early  1950's,  tne  need 
still  exists  today  for  faster  computers.  Tbis  need  is  most 
critical  in  tne  area  of  scientific  computing,  vnere  mere 
exist  computations  requiring  on  the  order  of  a  billion 
floating  point  operations  per  second  [DENNIS,  1980] . 

One  approach  to  achieving  higher  computation  speed  is  to 
Increase  tne  speed  of  tne  basic  logic  devices  of  the 
computer.  This  approach,  effective  in  the  past,  faces 
significant  obstacles  to  future  gains  because  of  the  speed 
of  light  limitation  to  signal  propagation  and  limitations  in 
the  integrated  circuit  manufacturing  process. 

A  second  approacn  to  achieving  nigner  computation  speed 
is  through  the  exploitation  of  parallelism  wnich  is  (or  can 
be)  innerent  in  algorithms  used  to  solve  a  vide  range  of 
scientific  problems.  Such  parallelism  can  be  present  at  both 
tne  operation  and  procedure  levels  in  a  program.  Thus  far, 
such  exploitation  of  parallelism  has  not  reached  a  limiting 
threshold  to  faster  computation. 


Data  flow  computing  has  been  proposed  as  a  conceptually 
viable  method  of  achieving  nlgher  computational  speed 
through  greater  exploitation  of  Inherent  algorithmic 


parallelism.  A  computer  based  on  tae  data  flow  concept 
executes  an  instruction  when  its  operands  become  available. 
No  sequential  control  flow  notion  exists.  Data  flow  programs 
are  free  of  sequencing  constraints  except  tnose  imposed  by 
the  flow  of  operands  between  instructions.  Thus,  a  data  flow 
computer  contrasts  fundamentally  with  tne  "von  Neumann" 
model.  Even  so*  the  data  flow  concept  is  capable  of 
incorporating  into  one  system  all  the  known  forms  of 
parallelism  exploitation  including  vectorization , 
pipelining,  and  multiprocessing. 

B.  RESEARCH  APPROACH 

It  was  the  purpose  of  this  research  to  gain  Insights 
into  the  degree  of  parallelism  exploitation  obtainable  with 
the  data  flow-based  high  speed  computation  method.  The 
classical  Issues  of  hardware  utilization,  program  execution 
timet  and  "degree”  of  multiprogramming  were  investigated  in 
tne  context  of  data  flow.  Application  of  an  existing  Petri 
net-based  methodology  was  the  technique  used  to  gain  these 
insights. 

The  hypothesis  for  this  research  had  two  parts.  First, 
the  suitability  of  the  Petri  net-based  Requester-Server 
metnodology  for  prediction  of  tne  performance  of  data  flow 
machines  in  an  efficient,  accurate  manner  was  to  be 
explored.  Second,  a  cnallenge  to  the  data  flow  concept  was 
made.  It  was  hypothesized  that  the  goal  of  achieving  higher 
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speed  computation  tnrougn  data  flow  computing  Is 
unattainable  without  acnleving  a  high  and  "intelligent" 
degree  of  multiprogramming.  By  "intelligent"  it  is  meant 
that  the  mapping  of  processes  onto  the  hardware  shall  nave 
to  be  done  in  a  near  optimal  fashion,  defined  in  terms  of 
hardware  utilization  and  program  execution  time. 

To  explore  tne  two-part  hypothesis,  sets  of  Petri  net 
models  of  data  flow  programs,  characterized  by  a  range  of 
inherent  parallelism,  were  "executed"  on  Petri  net  models  of 
tne  Dennis-Mlsunas  data  flow  hardware  design  [DENNIS,  AUG 
1974],  using  the  methodology  called  Requester-Server  [COX, 
1978].  The  hardware  models  were  varied  in  the  number  of 
processing  elements  available  for  use  in  "executing"  tne 
sets  of  program  models.  Thus  software  models  were  "run"  on 
nardware  models,  and  appropriate  performance  indices  were 
measured  and  analyzed. 

This  research  is  important  because  it  suggests  a  metnod 
for  mapping  data  flow  programs  onto  the  data  flow  machine  to 
acnleve  tne  desired  degree  of  hign  speed  computation. 
Additionally,  the  Requester-Server  (R-S)  methodology  has 
been  snown  to  be  an  effective  tool  for  predicting  tne 
performance  of  data  flow  computer  architectures. 

Grateful  acknowledgment  is  made  to  L.  A.  Cox,  designer 
and  initial  lmplementer  of  tne  R-S  metnodoiogy,  and  to  D.  M. 
Stowers,  who  modified  the  R-S  software  to  enable  it  to  run 
on  tne  PDP-11/50  minicomputer  at  NPS. 
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C.  ORGANIZATION 


The  results  reported  nere  are  organized  in  a  fasnlon 
conducive  to  tne  communication  of  experimental  computer 
science  endeavors.  Following  a  review  of  tne  applicable 
literature  (Section  II),  tne  hypotnesis  (Section  III)  is 
presented.  Next,  tne  method  used  to  test  tne  hypothesis  is 
presented  in  detail  (Section  IV).  This  section  discusses  tne 
experimental  design  which  includes  tne  identification  of 
independent  and  dependent  variables,  cnaracterl zes  and 
explains  tne  Petri  net  definition  of  the  data  flow  hardware 
and  software,  and  ends  witn  an  account  of  tne  procedure  used 
to  implement  the  experiment  to  test  the  hypothesis.  Results 
of  the  experiment  and  a  discussion  thereof  are  covered  in 
Section  V.  This  section,  in  addition  to  demonstrating  tne 
suitability  of  the  R-S  technique,  and  exploring  the 
multiprogramming  response  of  data  flow  architectures, 
presents  some  unexpected  findings.  Section  VI  summarizes  the 
entire  research  effort,  Including  the  results.  Finally, 
Section  VII  presents  recommendations  for  furtner 
investigation  in  the  area  of  data  flow  research. 


II.  LITERATURE  REVIEW 


A.  APPROACHES  TO  PARALLELISM 

In  general,  computer  science  literature  approaches  tae 
concept  of  parallelism  exploitation  from  eitner  an 
architecture  (hardware)  or  language  (software)  point  of 
'  view.  In  tnis  tfiesis,  "paralleli sm"  snail  be  viewed  as 
existing  at  many  hierarchical  levels  within  algorithms.  Any 
of  several  different  computer  architectures  may  be  capable 
of  exploiting  the  parallelism  which  exists  at  one  or  more  of 
these  various  hierarchical  levels.  As  should  be  expected, 
each  architecture  is  best-suited  at  exploiting  lnnerent 
algorithmic  parallelism  at  a  particular  hierarchical  level, 
but  not  at  others.  In  contrast,  implementation  of  tne  data 
flow  concept  proposes  to  exploit  Inherent  algorithmic 
parallelism  at  all  hierarchical  levels,  in  an  efficient 
fashion.  Before  presenting  tne  concept  of  data  flow,  a 
review  of  the  range  of  architectures  and  strategies 
currently  used  to  exploit  parallelism  is  presented. 

In  an  early  study  of  high  speed  computer  architectures 
[FLTNN ,  19S6J  a  four  element  taxonomy  was  developed  wnich 

classified  computer  systems  in  terms  of  the  amount  of 
parallelism  in  their  instruction  streams  and  data  streams 
(see  figure  II. A. 1).  (An  instruction  stream  is  tne  series  of 
operations  used  by  the  processor!  a  data  stream  is  the 


series  of  operands  used  by  tne  processor.)  Tne  first  element 
of  this  taxonomy,  depicted  in  figure  II. A. 1(a),  is  tne 
serial  computer  which  executes  one  Instruction  at  a  time, 
affecting  at  most  one  data  item  at  a  time.  Sucn  a  serial 
machine  is  denoted  as  a  single-instruction 
single-data-stream  (SISD)  computer.  Tne  SISD  computer  can  be 
characterized  as  possessing  no  capability  for  exploitation 
of  algorithmic  parallelism.  The  three  remaining  computer 
system  organizations  within  the  Flynn  taxonomy  do  possess 
capabilities  for  exploiting  algorithmic  parallelism. 

By  allowing  more  than  one  data  stream  a 
single-instruction  multiple-data-stream  computer  results,  as 
shown  in  figure  II. A. 1(b).  This  organization  allows 
vectorization  and  is  Known  as  a  vector  or  array  processor 
because  eacn  instruction  operates  on  a  data  vector  during 
each  instruction  cycle,  rather  than  on  just  one  operand.  The 

model  in  figure  II. A. 1(b)  shows  N  processors  each  accepting 
as  input  its  own  data  stream.  It  is  notewortny  tnat  eacn  of 
tne  N  processors  is  not  a  standalone  serial  machine  (SISD 
computer)  because  tne  N  processors  take  tne  same  Instruction 
from  an  external  control  unit  at  each  time  step. 

If  tne  SISD  computer  is  extended  to  permit  more  tnan  one 
instruction  stream,  the  multiple-instruction 
single-data-stream  (MISD)  computer  shown  in  figure  II. A. 1(c) 
results.  This  computer  system  organization  within  Flynn's 
taxonomy  is  present  for  completeness  but  has  yet  to  be  shown 
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to  possess  much  utility.  An  example  of  such  a  macaine  would 
be  one  built  to  generate  tables  of  functions  (suca  as 
squares  and  square  roots)  of  a  stream  of  numbers.  Each 
processor  would  perform  a  different  function  on  tne  same 
data  item  at  each  time  step. 

Tne  fourtn  and  final  element  of  Flynn's  taxonomy  is  one 
which  possesses  parallelism  in  both  the  instruction  and  data 
streams.  This  multiple-instruction  multiple-da ta-stream 
(MIMD)  computer  (snown  in  figure  II. A. 1(d))  is  made  up  of  N 
complete  SISD  macnines  which  are  interconnected  for 
communication  purposes.  Sucn  a  parallel  architecture  is  more 
readily  recognized  as  a  multiprocessor  in  which  as  many  as  N 
processors  can  be  performing  useful  work  at  the  same  time. 

Beyond  Flynn's  taxonomy  are  other  approaches  to 
parallelism.  The  first,  pipelining,  Is  a  strategy  which 
maxes  use  of  the  fact  that  a  processor,  in  executing  an 
instruction,  actually  performs  a  sequence  of  functions  in 
various  functional  units  of  the  processor.  Eacn  function  is 
performed  at  a  different  stage  alone  the  pipeline.  Figure 
II. A. 2  shows  a  processor  with  a  simple  pipeline  design. 
Hather  than  waiting  for  eacn  instruction  to  be  completely 
executed  before  beginning  the  next  instruction,  the  pipeline 
processor  begins  execution  of  the  next  instruction  as  soon 
as  functional  units  at  the  beginning  of  the  pipeline  are 
available.  Thus,  the  pipeline  is  normally  full,  containing 
more  than  one  instructions  in  various  stages  of  execution. 


The  final  approach  to  parallelism  to  be  presented  is  the 
strategy  of  overlapping.  In  the  traditional  sense, 
overlapping  within  a  computer  system  occurs  when  the  central 
processing  unit  (CPU)  is  allowed  to  function  concurrently 
with  input/output  (I/O)  operations.  Such  concurrency  was 
prevented  in  early  computers  because  I/O  operations  required 
data  paths  to  memory  which  ran  through  CPU  registers, 
preventing  CPU  functions  from  occurring  while  performing 
I/O.  Overlapping  can  occur  in  other  ways  within  a  computer 
but  the  example  given  is  sufficient  to  convey  the  general 
idea. 

The  techniques  for  exploiting  algorithmic  parallelism 
tnat  nave  been  presented  are  not  all  mutually  exclusive.  For 
example,  the  strategies  of  pipelining  and  overlapping  can  be 
Included  in  any  of  the  four  architectures.  Furthermore, 
other  more  complex  machine  organizations  have  been  proposed. 
One  example  is  the  multiple  SIMD  ( MSIMD )  machine  which 
consists  of  more  tnan  one  control  units  snaring  a  pool  of 
processors  through  a  switching  network  [HWANG,  1979] .  Such 
hybrids  will  not  be  considered  further. 

Having  described  the  major  architectural  approaches  to 
exploiting  algorithmic  parallelism  it  is  appropriate  to 
characterize  the  problems  for  which  each  method  is  suitable 
and  to  present  some  of  the  difficulties  that  still  exist  In 
using  each  method.  By  "suitable"  It  Is  meant  tnat  tne  metnod 
allows  the  processing  of  a  problem  in  such  a  manner  that 


some  speedup  In  execution  tine  is  acnieved  In  comparison 
with  what  tae  execution  time  would  be  for  the  problem  run  on 
a  serial  macnlne. 

The  main  Implementation  of  the  SIMD  architecture*  the 
array  processor,  Is  suitable  for  computations  which  can  be 
described  bjr  vector  instructions.  Also,  operands  processed 
simultaneously  must  be  capable  of  being  fetched 
simultaneously  from  memory.  Finally,  processor 
Interconnections  must  support  high  speed  data  routine 
between  processors.  If  any  of  the  above  conditions  are  not 
met,  then  the  computation  may  execute  in  a  predominantly 
serial  fashion  within  this  SIMD  computer.  Because  of  these 
required  conditions,  the  array  processor  is  generally 
considered  to  be  a  specialised ,  rather  than  general 
purpose,  machine. 

As  previously  mentioned,  the  MISD  architecture  exists 
merely  to  complete  tne  Flynn  taxonomy  and  will  not  be 
discussed  further  [STONE,  1980] . 

The  dominant  MIMD  computer  is  the  multiprocessor.  The 
multiprocessor  is  considered  to  be  a  general  purpose 
machine.  Accordingly,  many  problems  should  be  well-suited 
for  execution  on  such  an  architecture.  Despite  the  fact  that 
such  systems  have  been  shown  to  work  well  in  a  number  of 
applications,  especially  those  which  consist  of  a  number  of 
concurrently-processable  subproblems  with  minimal  data 
sharing,  numerous  questions  have  yet  to  be  answered.  These 


questions  include  now  to  Dest  organize  tne  parallel 
computations  (i.  e.  partition  tne  problem)  to  optimize  the 
use  of  tne  cooperating  processors,  now  to  syncnronize  tne 
processors  in  the  system,  and  how  to  best  share  the  data 
among  system  processors.  Also,  problems  which  possess  an 
iterative  structure  can  run  efficiently  on  an  array 
processor  and  avoid  the  overhead  of  synchronization  and 
scheduling  required  of  the  multiprocessor  [STONE,  1960J. 

Although  not  considered  "architectures"  in  the  sense  of 
Flynn's  taxonomy,  both  pipelining  and  overlapping  (of  whicn 
there  exist  different  types)  are  general  purpose  strategies 


that  can  be 

applied 

to  most  problems. 

Like 

tne 

multiprocessor. 

these 

techniques  also 

perml i 

tne 

partitioning  of  a  problem  so  tnat  several  operating  hardware 
pieces  can  function  concurrently.  Accordingly,  pipelining 
and  overlapping  are  often  considered  to  be  forms  of 
multiprocessing.  Tne  difference  lies  in  the  fact  that 
pipelining  and  overlapping  perform  partitioning  at  different 
hierarchical  levels  of  a  problem  than  does  the 
multiprocessing  technique. 

Armed  with  an  understanding  of  the  diverse  architectural 
approaches  to  parallelism  exploitation  that  nave  been  used 
to  date,  it  is  logical  to  proceed  with  an  alternative 
approach,  that  of  data  flow.  Before  doing  so,  nowever,  Petri 
nets  will  be  introduced.  This  is  appropriate  because  the 
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concepts  of  data  flow  computation  are  a  direct  application 
of  Petri  net  theory. 


B.  PETRI  NETS 


Petri  net  theory 

plays  an  Important 

part 

in 

this 

research  endeavor  for 

two  reasons.  First, 

as 

has 

been 

mentioned,  Petri  net  taeory  forms  the  basis  for  the  concepts 
used  to  describe  and  define  data  flow  computation.  Second, 
Petri  net  theory  is  the  basis  for  the  Requester-Server 
methodology  that  is  used  in  this  thesis  research  as  a 
computer  performance  prediction  tool.  Because  of  Its 
applicability,  Petri  net  theory  shall  be  presented  herein. 
In  an  Informal  manner,  with  empnasls  placed  on  Its  use  In 
modelling  parallel  computation.  Those  desiring  a  more  formal 
and  complete  discussion  of  Petri  nets  are  referred  to 
[PETERSON,  1977]. 

Petri  nets  may  be  thought  of  as  formal,  abstract  models 
of  information  flow.  Their  main  use  has  been  In  the 
modelling  of  systems  of  events  in  which  some  events  may 
occur  concurrently  but  there  exist  constraints  on  the 
frequency,  precedence  and  concurrence  of  these  events.  A 
Petri  net  graph  models  the  static  structure  of  a  system.  The 
dynamic  properties  of  a  system  can  be  represented  by 
"executing"  tne  Petri  net  In  response  to  the  flow  of 
Information  (or  occurrence  of  events)  in  the  system. 
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The  static  graph  of  a  Petri  net  is  made  up  of  two  types 
of  nodes:  circles  (called  places)  wnlcn  represent 
conditions*  and  bars  (called  transitions)  which  represent 
events.  These  nodes  are  connected  by  directed  arcs  running 
from  either  places  to  transitions  or  transitions  to  places. 
The  source  of  a  directed  arc  is  the  input,  and  the  terminal 
node  is  the  output.  The  position  of  information  in  a  net  is 
represented  by  markers  called  tofcens. 

Tne  dynamic  execution  of  a  Petri  net  is  controlled  by 
the  position  and  movement  of  the  tofcens.  A  tofcen  moves  as  a 
result  of  a  transition  firing.  In  order  for  a  transition  to 
fire,  it  must  be  enabled.  A  transition  is  enabled  when  all 
of  the  places  which  are  inputs  to  a  transition  are  marfced 
with  a  tofcen.  Upon  transition  firing,  a  token  is  removed 
from  each  of  the  input  places  and  a  tofcen  is  placed  on  each 
of  the  output  places  of  the  transition.  Thus,  in  modelling 
the  dynamic  behavior  of  a  system,  the  occurrence  of  an  event 
is  represented  by  the  firing  of  the  corresponding 
transition. 

Figures  II.B.l  through  II. B. 4  show  a  Petri  net  at 
progressive  stages  of  execution.  As  can  be  observed,  the 
status  of  the  execution  at  a  given  time  can  be  described  by 
the  distribution  of  the  tokens  in  the  net.  This  distribution 
of  tofcens  in  a  Petri  net  is  called  the  net  marking  and 
uniquely  defines  the  state  of  tne  net  for  any  given  instant. 
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Petri  nets  are  uninterpreted  models.  Thus,  some 
significance  must  be  attached  to  token  movement  to  indicate 
the  intent  of  the  model.  This  is  usually  done  by  labelling 
the  nodes  of  a  net  to  correspond  in  some  way  to  the  system 
being  modelled .  However,  it  snould  be  remembered  that  the 
labelling  of  the  nodes  of  a  Petri  net  in  no  way  affects  its 
execution,  A  second  attribute  of  Petri  nets  is  their  ability 
to  model  a  system  hierarchically.  An  entire  net  may  be 
replaced  by  a  single  node  (place  or  transition)  for 
modelling  at  a  greater  level  of  abstraction  or,  conversely, 
a  single  node  may  be  replaced  by  a  subnet  to  snow  greater 
detail  in  tne  model. 

Petri  nets,  as  a  formal  grapn  model,  are  especially 
useful  in  modelling  the  flow  of  information  and  control  in 
systems  wnich  can  be  cnaracterited  by  asynchronous  and 
concurrent  behavior.  Figure  II. B. 5  shows  the  initial  marking 
of  a  Petri  net  model  of  such  a  system.  Initially,  transition 
El  is  enabled  because  each  of  its  Input  places.  Cl  and  C2, 
is  marked  with  a  token.  Firing  transition  El  removes  one 
token  eacn  from  places  Cl  and  C2,  and  puts  a  token  into  each 
output  place,  C3  and  C4*  At  this  point  in  the  net  execution, 
transition  E3  is  disabled  because  one  of  its  input  places, 
C5,  still  has  no  token.  Transition  E2,  however,  is  enabled, 
and  upon  firing  causes  a  token  to  be  removed  from  place  C3 
and  one  deposited  in  place  C5.  As  an  aside,  this  portion  of 
the  model  could  correspond  to  a  system  sequencing 


constraint,  tnat  of  event  E3  having  to  wait  until  event  E2 
completes.  Upon  firinsr  transition  E3,  places  C6  and  C7 
become  marked  vitn  toirens  as  places  C4  and  C5  lose  a  toiren 
each.  Transitions  E4  and  E5  are  now  enabled  and  can  fire 
simultaneously,  the  occurrence  of  wdich  corresponds  to 
concurrent  events  in  a  modelled  system.  Doing  so,  tnat  is, 
firing  transitions  E4  and  E5,  brings  the  Petri  net  model 
back  to  its  original  (initial)  configuration. 

One  other  situation  that  can  be  represented  usine  Petri 
nets  is  tnat  of  conflict.  Figure  II. B. 6  shows  a  net  model  of 
such  a  situation.  Simply,  transitions  El  and  E2  are  both 
enabled.  However,  if  either  transition  fires,  the  remaining 
transition  becomes  disabled.  In  such  a  case,  it  is  an 
arbitrary  decision  as  to  which  one  fires.  Because  we  would 
like  to  be  able  to  duplicate  experiments  and  obtain  the  same 
results,  a  scheme  that  is  often  used  involves  simply 
assigning  priorities  to  transitions  which  are  subject  to 
conflict  in  a  net.  In  this  way  reproducible  results  can  be 
ensured.  If  true  nondeterminism  is  desired  in  such  a  model, 
a  scheme  in  which  probabilities  are  associated  with  each 
transition  can  effectively  model  nondeterminism  in  the 
system  under  study. 

Thus,  to  properly  model  a  system  with  Petri  nets,  every 
sequence  of  events  in  the  modelled  system  should  be  possible 
in  the  Petri  net  and  every  sequence  of  events  in  the  Petri 
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Figure  II.B.l:  MARKED  PETRI  NET,  TIME=0 


Figure  II. B. 2:  MARKED  PETRI  NET,  TIME=1 


22 


T3  PI  P2  T2 


Figure  II. B. 3:  HARKED  PETRI  NET,  TIME=2 


net  should  represent  a  possible  sequence  In  the  modelled 
system. 

This  section  has  Introduced  Petri  nets  and  demonstrated 
tnelr  usefulness  In  formally  modelling  information  and 
control  flow  In  systems  characterized  by  asynchronous  and 
concurrent  behavior.  Headers  interested  in  the  use  of  Petri 
nets  for  performance  evaluation  of  sucn  systems  are  referred 
to  [RAMiMOORTHr ,  1980J  and  [RAMCHANDAN I ,  1974].  The 
following  section  snail  introduce  readers  to  the  concept  of 
data  flow  which,  as  mentioned  previously,  is  a  direct 
application  of  Petri  net  tneory. 

C.  CONCEPT  OF  DATA  FLOW 

Data  flow  computing  Is  a  metnod  of  multiprocessing  wnlcn 
proposes  to  exploit  Inherent  algorithmic  parallelism  at  all 
hierarchical  levels  within  a  program.  Additional  objectives 
Include  effectively  using  the  capabilities  of  LSI  technology 
and  simplifying  the  programming  task.  The  concept  of 
computation  under  data  flow  was  derived  by  Dennis  [DENNIS, 
1974]  (and  a  number  of  others  working  independently), 
predominantly  from  Karp  and  Miller's  [KARP,  1966J  work  on 
computation  graphs.  This  section  begins  by  presenting  the 
data  flow  concept  from  the  perspective  of  language,  rather 
than  that  of  architecture.  This  approach  is  appropriate  in 
view  of  the  fact  that  data  flow  computer  systems  are  being 
designed  as  hardware  Interpreters  for  a  base  language  that 
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is  fundamentally  different  from  conventional  languages.  A 
hardware  description  of  the  Dennis-Misunas  data  flow 
architecture  design  completes  tnis  section. 

In  a  data  flow  computer,  an  Instruction  is  executed  as 
soon  as  its  operands  become  available.  No  notion  of  separate 
control  flow  exists  because  the  data  dependencies  define  the 
flow  of  control  in  a  data  flow  program.  In  fact,  data  flow 
computers  have  no  need  for  a  program  location  counter. 

This  contrasts  with  the  traditional  ”von  Neumann" 
computer  arcnitecture  model  which  uses  a  global  memory  wnose 
state  is  altered  by  the  sequential  execution  of 
instructions.  Such  a  model  is  limited  by  a  "bottleneck" 
between  the  computer  control  unit  and  the  global  memory 
[BACKUS ,  1978].  This  "feature"  allows  conventional  languages 
to  have  side-effects,  a  common  example  of  whicn  is  tne 
ability  of  a  procedure  to  modify  variables  in  the  calling 
program.  Such  side-effects  are  prohibited  under  tne  data 
flow  concept.  Furthermore,  in  data  flow,  no  variables  exist, 
nor  are  there  any  scope  or  substitution  rules.  In  fact,  tne 
data  flow  concept  prohibits  the  modification  of  anytning 
that  has  a  value.  Rather,  data  flow  computing  takes  Inputs 
(operands)  and  generates  outputs  (results)  that  have  not 
previously  been  defined.  Thus,  instructions  in  data  flow  are 
pure  functions.  This  is  necessary  so  that  instruction 
execution  can  be  based  solely  on  the  availability  of  data 
(operands).  Thus  the  data  dependencies  must  be  equivalent 


to*  and  In  fact  define,  t&e  sequencing  constraints  in  a 
program.  Also,  to  exploit  parallelism  at  all  levels,  it  must 
be  possible  to  derive  tnese  data  dependencies  from  tne  nign 
level  language  program  instructions  [ACKERMAN,  1979J . 

A  language  which  allows  processing  by  means  of  operators 
applied  to  values  is  called  an  applicative  language.  VAL 
(Value-oriented  Algoritnmic  language)  is  a  nign  level  data 
flow  applicative  language  under  development  at  MIT  [ACKERMAN 
and  DENNIS,  1979].  It  prevents  any  side-effects  by  requiring 
programmers  to  write  expressions  and  functions?  statements 
and  subroutines  are  not  allowed  in  tbe  language.  Because  of 
tbls  constraint,  most  concurrency  is  apparent  in  a  bleb 
level  language  program  written  in  VAL.  For  tne  purposes  of 
this  research,  no  further  understanding  of  the  high  level 
language  of  data  flow  is  required.  Information  about  high 
level  language  alternatives  is  available  in  [McGRAW,  198fc!j  , 
[ACKERMAN  and  DENNIS,  1979],  and  [ACKBRMAN ,  1979] . 

At  wnat  would  correspond  to  tne  assembly  language  level, 
a  data  flow  computation  can  be  represented  as  a  graph.  Tbe 
nodes  of  tbe  graph  correspond  to  operators  and  the  arcs 
represent  data  paths.  An  arc  into  a  node  represents  an  input 
operand  path?  an  arc  leaving  a  node  corresponds  to  a  result 
patn.  Data  flow  graph  execution  occurs  as  operands  become 
available  at  each  node.  When  the  input  arcs  of  a  node  each 
have  a  value  on  them,  the  node  can  execute  by  removing  those 
ralues,  computing  tne  operation,  and  placing  tne  results  on 
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function  Stats  (X,Y,Z:  rtal  raturai  rnnl,  rnnl) 
lot 

Mean  rnnl  :=  (X  +  Y  +  Z)  /  3; 

SD  rnnl  :=  SQRT(  (X2  +  Y2  +  Z2)  /  3  - 

In 

Mean  ,  SD 

nndlnt 

nndfun 


Figure  II. C. 2;  A  SIMPLE  STATISTICS 
FUNCTION  AND  ITS  DATA  FLOW  GRAPH 
[McGRAW,  1980] 


the  output  arcs  (see  Figure  II.C.l).  The  example  data  flew 
*raph  in  figure  II. C. 2  computes  tne  mean  and  standard 
deviation  of  its  three  input  parameters. 

Such  a  grapn  notation  is  useful  in  illustrating  tne 
various  levels  of  parallelism  in  a  program.  For  example,  a 
grapn  node  may  represent  a  simple  operator  sued  as  addition, 
or  the  entire  statistics  function  of  figure  II. C. 2.  Tnus  tne 
data  flow  graph  notation  can  represent  parallelism  existing 
at  the  operator,  function  and  even  computation  level.  The 
graphs  execute  asynchronously,  nodes  firing  when  data  Inputs 
are  available.  Tnus  no  synchronization  problem  exists  with 
regard  to  accessing  shared  data.  Each  data  flow  path  can  be 
marked  with  a  value  by  only  one  operator  node.  Once  a  value 
is  on  a  path,  no  operator ‘can  modify  that  value.  The  value 
can  only  be  read  when  used  as  an  input  to  another  node 
[McGrRAW ,  1980]. 

Again,  the  data  flow  graph  notation  merely  allows  a 
logical  representation  of  a  program  at  a  level  corresponding 
to  conventional  assembly  language.  Tnis  logical 
representation  snail  now  be  extended  to  permit  tne  reader  to 
understand  the  basic  data  flow  hardware  Instruction 
execution  mecnanlsm.  A  simple  example  computation  that  shall 
be  used  to  facilitate  reader  understanding  is  the  following: 

Z»(X+I)*(I-I) 
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II. C. 4:  AN  ACTIVITY  TEMPLATE  FOR 
THE  ADDITION  OPERATOR 


Figure  II. C. 5:  PROGRAM  GRAPH  USING  ACTIVITY 
TEMPLATES  FOR  THE  DATA  FLOW  PROGRAM  GRAPH  OF 
Figure  II.  C.  3  DENNIS,  19  8(J 


Tne  grapn  representation  of  this  computation  Is  snown  in 
figure  II. C. 3. 

In  tue  extended  grapn  representation  scueme.  a  data  flow 
program  exists  as  a  collection  of  activity  templates,  each 
template  corresponding  to  a  node  In  tne  data  flow  program 
graph.  For  example,  figure  II. C. 4  snows  an  activity  template 
for  tue  addition  operator.  Tnere  are  four  fields  in  tne 
activity  template.  Tue  first  field  denotes  tue  operation 
code  which  specifies  the  operation  to  be  performed.  The 
second  and  tnlrd  fields  are  receivers,  wuicu  are  locations 
valtin*  to  receive  operand  values.  The  fourth  field  is  a 
destination  field  which  specifies  where  the  result  of  the 
operation  on  the  operands  is  to  go.  There  can  be  multiple 
destination  fields.  Figure  II. C. 5  shows  tae  program  graph 
representation  of  figure  II. C. 3,  using  activity  templates. 

Activity  templates  have  been  developed  which  control  the 
routing  of  data  for  sucu  program  structures  as  conditionals 
and  iterations.  These  templates  are  mentioned  to  point  out 
the  fact  that  graph  nodes  can  represent  not  only  simple 
operands  but  can  also  represent  more  elegant  and  necessary 
constructs. 

Some  definitions  which  are  necessary  to  tue 
understanding  of  the  data  flow  Instruction  execution 
mechanism  follow.  First,  a  data  flow  program  instruction  is 
the  fixed  portion  of  an  activity  template  and  Is  made  up  of 
tue  opcode  and  tue  destinations. 


instruction: 

<opcode,  destina tions> 

Bach  destination  field  provides  the  address  of  some  activity 
template  and  an  input  (or  offset)  denoting  which  receiver  of 
the  template  is  the  target. 

destination : 

<address,  input> 

Data  flow  program  execution  occurs  as  follows.  Tne 
fields  of  a  template  which  has  been  activated  (by  the 
arrival  of  an  operand  value  at  each  receiver)  form  an 

operation  packet: 

<opcode»  operands,  destlnatlons> 

When  tne  operation  packet  nas  been  operated  upon,  a  result 
packet  of  the  form 

result  packet: 

<value,  destination> 

is  generated  for  each  destination  field  of  tne  original 
activity  template.  Result  packet  generation  triggers  tne 
placement  of  the  value  in  the  receiver  designated  by  its 
destination  field.  Thus,  at  a  logical  level,  data  flow 
program  execution  occurs  as  a  consequence  of  operation 
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packet  and  result  packet  movements  tnrougn  a  machine 
described  in  detail  below. 

The  basic  data  flow  instruction  execution  mechanism  is 
shown  in  figure  II. C. 6.  Tne  data  flow  program,  consisting  of 
a  collection  of  activity  templates,  is  neld  in  the  activity 
store  (see  figure  II. C. 6).  Each  activity  template  is 
uniquely  addressable  within  the  activity  store.  When  an 
instruction  is  ready  to  be  executed  (1.  e.  the  template  is 
enabled),  this  address  is  entered  in  the  instruction  queue 
unit  (established  as  a  PIEO  buffer). 

The  fetch  unit  is  then  responsible  for:  removing,  one  at 
a  time,  instruction  addresses  from  the  instruction  queue, 
fetcning  tne  corresponding  activity  template,  forming  an 
operation  packet  based  on  the  field  values  in  the  template, 
and  submitting  the  operation  packet  to  an  operation  unit  for 
processing.  The  operation  unit  processes  the  operation 
pacicet  by:  performing  tne  operation  specified  by  the  opcode 
on  the  operands,  forming  result  packets  (one  for  each 
destination  field  of  the  operation  pacicet),  and  transmitting 
the  result  packets  to  the  update  unit.  The  update  unit  fills 
in  the  receivers  of  activity  templates  (designated  by  the 
destination  fields  in  tne  result  packets)  witn  tne 
appropriate  values.  The  update  unit  is  also  responsible  for 
checking  the  target  template  to  see  if  it  has  all  receivers 
filled,  thus  enabling  the  template.  If  so,  the  address  of 


the  enabled  template  Is  added  at  tne  end  of  the  Instruction 
queue  by  the  update  unit. 

At  tills  point  it  Is  appropriate  to  discuss  now  and  where 

program  parallelism  can  be  exploited  by  this  nardware. 

"...once  the  fetch  unit  has  sent  an  operation  packet  off 
to  the  operation  unit,  it  may  immediately  read  another 
entry  from  tne  Instruction  queue  witnout  waiting  for  tne 
Instruction  previously  fetched  to  be  completely  processed. 
Thus  a  continuous  stream  of  operation  packets  may  flow 
from  the  fetch  unit  to  the  operation  unit  so  Ion*  as  the 
Instruction  queue  is  not  empty. 

"This  mechanism  is  aptly  called  a  circular  pipeline- 
activity  controlled  by  tne  flow  of  information  packets 
traverses  the  ring  of  units  leftwise.  A  number  of  packets 
may  be  flowing  simultaneously  in  different  parts  of  the 
ring  on  benalf  of  different  instructions  in  concurrent 
execution.  Thus  the  ring  operates  as  a  pipeline  system 
with  all  of  its  units  actively  processing  packets  at  once. 
The  degree  of  concurrency  possible  is  limited  by  tne 
number  of  units  on  the  ring  and  the  degree  of  pipelining 
within  each  unit.  Additional  concurrency  may  be  exploited 
by  splitting  any  unit  in  tne  ring  into  several  units  wnicn 
can  be  allocated  to  concurrent  activities."  [DENNIS, 
NO71980] 

The  Dennis-tfisunas  data  flow  architecture  for 
implementine  the  described  instruction  execution  mechanism 
is  called  tne  cell  block  architecture  and  is  illustrated  in 
figure  II. C. 7. 

"The  heart  of  this  architecture  is  a  lar*e  set  of 
instruction  cells,  each  of  wnicn  holds  one  activity 
template  of  a  data  flow  program.  Result  packets  arrive  at 
instruction  cells  from  the  distribution  network.  Each 
instruction  ceil  sends  an  operation  packet  to  tne 
arbitration  network  when  all  operands  and  signals  nave 
been  received.  The  function  of  the  operation  section  is  to 
execute  instructions  and  to  forward  result  packets  to 
target  instructions  by  way  of  the  distribution  network.” 
[DENNIS,  NO? 1980] 


Figure  II. C. 9  reflects  a  practical  form  of  the  ceil 
block  architecture  which  makes  use  of  LSI  technology  and 
reduces  tae  number  of  devices  and  interconnections.  This 
practical  form  is  obtainable  by  grouping  tne  Instruction 
cells  of  figure  II. C. 7  into  blocks,  each  of  which  is  a 
single  device.  In  tnis  organization,  several  cell  blocks  are 
serviced  by  a  group  of  multifunction  processing  elements. 
The  arbitration  network  channels  operation  packets  from  cell 
blocks  to  processing  elements.  Rathe.*  than  employing  a  set 
of  processing  elements  each  capable  of  a  different  function, 
which  is  one  design  option,  use  of  one  multipurpose 
processing  element  type  is  the  favored  approach.  Such  an 
approach  precludes  the  need  for  tne  arbitration  network  to 
route  operation  packets  according  to  opcode.  Instead,  it 
simply  has  to  forward  operation  packets  to  any  available 
processing  element.  It  is  this  desien  which  forms  tne  basis 
for  the  system  model  used  in  this  research  effort. 

How  does  tae  basic  mecnanism  relate  to  the  cell  block 
architecture?  Figure  II. C. 9  shows  a  cell  block 
implementation.  It  differs  from  the  basic  mecnanism  in  two 
ways.  First,  the  cell  block  has  no  processing  element(s) 
(operation  unit(s)).  Second,  result  packets  targeted  for 
activity  templates  hell  in  the  same  cell  block  must  traverse 
the  distribution  network  before  being  handled  by  the  update 
unit  [DENNIS,  NOV  1980] .  This  is  tne  Dennis-Misunas  data 
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RESULT  PACKET 


OPERATION  PACKET 


Figure  II. C. 6:  DATA  FLOW  INSTRUCTION  EXECUTION 

MECHANISM  [DENNIS,  19  8(3 
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Figure  II. C. 9:  A  SIMPLE  CELL  BLOCK  IMPLEMENTATION 


CPENNIS,  1980] 


flow  architecture  design.  Other  designs  do  exist;  for 
examples,  see  [GOSTELOW,  1980  J  and  [WATSON,  1979] . 

D.  COMPUTER  PERFORMANCE  PREDICTION 

Computer  performance  prediction  Is  an  evaluation  process 
which  proposes  to  estimate  the  performance  of  a  system  not 
yet  In  existence  (i.e.  In  some  state  of  design). 
"Performance”  simply  means  now  well  a  system  works.  Tnls  in 
turn  connotes  the  concept  of  value.  So,  the  purpose  of 
estimating  the  performance  of  a  system  under  design  is  to 
determine  that  system's  expected  value. 

In  order  to  quantify  now  well  a  system  works  or  shall 
work,  performance  metrics  called  Indices  are  used.  Typical 
Indices  and  their  definitions  are: 

THROUGHPUT  RATE  -  Tne  volume  of  Information  processed 

hy  a  system  In  one  unit  of  time 

HARDWARE  UTILIZATION  -  Tne  ratio  between  tne  time  tne 

hardware  is  used  during  an  interval 
of  time,  and  the  duration  of  that 
interval  of  time 

RESPONSE  TIME  -  The  elapsed  time  between  tne  sub¬ 

mission  of  a  program  Job  to  a  system 
and  completion  of  tne  corresponding 
Job  output. 

Computer  performance  prediction  can  be  achieved  via 
several  different  techniques.  Each  technique  has  limitations 
and  advantages.  The  technique  utilized  in  this  thesis  Is 
that  of  simulation.  The  simulation  technique  involves  the 
representation,  by  a  model,  of  certain  aspects  of  tne 
behavior  of  a  system  In  the  time  domain.  Observing  these 
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aspects  of  toe  behavior  in  time  of  the  system's  model,  under 
inputs  generated  by  a  model  of  tne  system's  inputs,  produces 
results  useful  in  the  evaluation  of  the  modelled  system 
[FERRARI,  1979J .  For  tne  purposes  of  this  research,  the 
aspects  of  behavior  that  are  of  interest  are  the  performance 
indices  previously  defined. 

Of  significant  Importance  to  any  simulation  effort  are 
the  issues  of  validation  and  parameter  estimation. 
Conceptually,  validation  attempts  to  establish  some  degree 
of  confidence  that  the  simulation  shall  produce  results 
which  shall  closely  correspond  with  the  performance  of  the 
system  under  scrutiny.  Parameter  estimation  provides  the 
simulation  effort  with  hopefully  credible  parameter  values 
needed  to  perform  a  simulation  having  relevant  results. 
These  issues  shall  be  addressed  in  section  IV. A: 
Experimental  Design. 

The  last  section  of  the  review  of  the  literature 
applicable  to  this  research  endeavor  presents  tne 
Requester-Server  methodology.  The  Requester-Server 
methodology  is  the  "tool"  used  to  perform  the  simulation 
which  generates  the  results  on  which  the  prediction  of  data 
flow  performance  is  based. 

Readers  desiring  a  more  thorougn  presentation  of  tne 
subject  of  computer  performance  prediction  are  referred  to 
[FERRARI,  1979],  [COX,  1978] ,  [ALLEN,  1980],  [HAMMING , 
1975],  [SPRAGINS,  1980],  [BOZEN,  1980],  and  [SAUER,  I960]. 
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E.  REQUESTER-SERVER  METHODOLOGY 

The  Requester-Server  (R-S)  methodology  was  designed  and 
Initially  Implemented  by  L.  A.  Cox,  Jr.  [COX,  1978]  . 
Subsequently,  the  Requester-Server  software  was  modified  by 
D.  M.  Stowers  [STOWERS,  1979]  to  run  on  the  PDP-11/50 
minicomputer  at  NPS»  Tnis  section  summarizes  tnose  portions 
of  [COX,  1978]  and  [STOWERS,  1979]  which  are  applicable  to 
and  necessary  for  tne  understanding  of  tnis  research. 

The  R-S  methodology  is  capable  of  predicting  tne 
performance  of  computer  systems  characterized  by 
asynchronous,  concurrent  behavior.  The  methodology  can 
predict  performance  at  both  tne  computer  system  and  computer 
Job  levels.  The  R-S  methodology  allows  tne  user  to 
separately  specify  tne  hardware  conf iguration(s )  to  be 
evaluated,  the  software  (programs)  to  be  used  in  evaluating 
the  hardware  conf leuration(s) ,  and  the  mechanism  or  policy 
for  allocating  hardware  resources  to  program  requests  for 
service.  The  methodology  mates  provision  for  variable  levels 
of  detail  (in  a  hierarchical  sense)  in  both  the  hardware  and 
software.  Finally,  tne  R-S  metnodology  is  capable  of 
simulating  concurrency  in  both  the  hardware  and  software. 
Tttus,  for  a  given  hardware  configuration,  the  control 
structure  mandated  by  the  software  can  be  mapped  onto  tne 
hardware  and  system  performance  analyzed  and  predicted. 

The  simulation  process  is  begun  by  representing  tne 
software  (programs)  and  hardware  as  two  separate  Petri  net 
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grapns.  In  tne  Petri  net  grapn  of  tne  software,  eacn  arc  can 
be  thought  of  as  having  an  associated  propagation  delay,  tne 
extent  of  wnicn  is  dependent  upon  tne  nardware  configuration 
used  to  execute  tne  program.  If  these  delays  are  definable 
by  their  correlation  to  the  Petri  net  model  of  tne  hardware, 
tnen  performance  values  for  tne  indices  of  section  II. D 
(Computer  Performance  Prediction)  can  be  obtained  by 
executing  tne  Petri  net  model  of  the  software  on  the  Petri 
net  hardware  configuration^).  The  R-S  "tool"  serves  as  tne 
interface  between  tne  Petri  net  model  of  system  software  and 
Petri  net  model  of  system  nardware.  This  interface  permits 
the  hardware  and  software  Petri  net  graphs  to  be  constructed 
separately.  Tnis  is  important  because  tne  control  structure 
and  sequencing  constraints  of  both  hardware  and  software  can 
be  maintained  separately.  This  permits  a  direct  and 
meaningful  representation  of  both  tne  system  software  and 
hardware  being  modelled. 

The  source  file  which  serves  as  tne  input  to  tne  R-S 
program  Is  organized  into  three  sections.  The  software 
section  of  the  input  file  consists  of  a  description  of  tne 
Petri  net  graph  representing  the  software  program(s)  to  be 
executed.  This  net  graph  description  Is  formulated  in  terms 
of  the  functions  and  constraints  of  the  services  required  of 
the  hardware.  The  nardware  section  of  the  input  file  Is  made 
up  of  a  description  of  the  computer  system  components  and 
tnelr  Interconnections.  This  description  can  be 
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(hierarchically )  at  a  bit-level  or  major  component  level, 
depending  on  the  system  aspects  under  scrutiny.  The  Petri 
net  grapn  upon  vnicn  tne  hardware  description  is  based  is 
constructed  in  terms  of  its  operation  in  time.  The  last 
section  of  tne  input  file,  called  tne  dynamic  section, 
provides  tne  user  of  tne  R-S  "tool"  a  place  to  denote  system 
initial  conditions  by  defining  tne  nardvare  and  software 
nets'  token  markings  at  tne  beginning  of  a  "run",  As  may  be 
recalled  from  section  II. B  (Petri  Nets),  both  the  software 
and  hardware  sections  merely  define  static  Petri  net 
structures.  Performance  prediction  follows  from  the 
attachment  of  significance  to  the  structures  and 
restrictions  on  token  movement  witnin  tnese  structures. 

The  dynamic  nature  of  Petri  nets  is  exploited  by  this 
R-S  methodology  as  follows.  The  software  net  representation 
makes  a  series  of  requests  for  the  services  of  the  hardware 
net  representation.  Repeatedly,  the  R-S  process  maps  these 
requests  for  service  onto  the  nardware  net  representation. 
At  each  "invocation"  the  R-S  process  "runs"  the  hardware  net 
to  provide  tne  service  requested  by  tne  software  net.  Upon 
completion  of  each  of  the  service  requests,  the  R-S  process 
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runs  the  software  net  representation  until  the  hardware  is 
again  needed.  This  cycle  repeats  Itself  until  the  software 
net  representation  has  been  completely  "run”  and  its 


terminal  state  reached. 


X.  .  >.  .  ’« •%,*» 


Events  In  tne  hardware  net  grapn  correspond  to 
operations  in  time.  A  collection  of  events  is  used  to 
represent  each  functional  unit.  Token  movement  through  the 
Hardware  net  graph  corresponds  to  tne  flow  of  lata  and 
control  through  the  modelled  hardware  system.  A  simple 
hardware  net  description  is  provided  in  figure  II.E.l. 
Events  in  the  software  net  grapn  correspond  to  requests  for 
service.  As  an  example,  an  event  could  equate  to  a  request 
for  a  floating  point  multiplication.  The  flow  of  toicens  in 
the  software  net  graph  equates  to  the  logical  flow  of  the 
algoritnm,  constrained  by  its  implicit  data  dependencies  or 
sequencing  constraints.  A  simple  software  net  description  is 
provided  in  figure  II. E. 2. 

Togetner,  tne  software  and  hardware  net  graphs  can  be 
executed  in  such  a  way  as  to  simulate  the  operation  of  the 
computer  system  for  tne  given  software  workload.  The 
interaction  of  the  two  net  graphs  is  orchestrated  by  the  R-S 
token  arbiter.  Network  simulation  begins  with  the  marking  of 
the  "BEGIN"  node  of  the  software  net  graph.  This  net  graph 
is  then  executed  as  would  be  any  Petri  net  graph.  The 
arrival  of  a  token  at  any  place  in  tne  software  net  grapn 
Indicates  a  request  for  service,  at  which  time  the  R-S  token 
arbiter  takes  control.  (The  type  of  service  requested  is 
denoted  by  the  type  of  the  place  and  is  defined  in  the 
software  net  description.)  The  R-S  token  arbiter  removes  the 
token  from  the  software  net  and  then  permits  the  software 
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net  graph  to  continue  executing  until  no  furtner  moves  are 
possible.  Tbe  R-S  toxen  arbiter  tnen  initializes  tne 
hardware  functional  unit  (net  graph  denoted  by  the  type  of 
service  requested)  by  marxing  it  with  toxens.  The  hardware 
net  *raph  is  then  executed  one  step.  ToXens  reaching  events 
corresponding  to  service  completion  are  removed,  and  the 
toxen  of  the  software  net  which  originally  caused  the 
request  for  service  is  replaced,  by  the  R-S  toxen  arbiter. 
Repeating  this  sequence  of  actions  results  in  the  execution 
of  tne  software  net  graph  by  the  hardware  net  grapn.  A 
sample  input  file  dynamic  section  and  the  results  obtained 
from  executing  tne  software  and  nardware  net  graph 
descriptions  of  figures  II.E.l  and  II. E. 2  are  presented  in 
figure  II. E. 3.  Those  readers  interested  in  the 
Requester-Server  metnodology  are  referred  to  [COX,  197BJ  and 
[STOWERS,.  1979]  for  a  more  in-depth  discussion  of  its 
capabilities  and  usage. 

This  completes  the  necessary  review  of  the  literature 
required  to  understand  tne  research  that  follows.  The 
fundamental  concepts  of  the  various  approaches  to 
parallelism,  Petri  nets,  tne  lata  flow  architecture, 
computer  performance  prediction,  and  the  Requester-Server 
methodology  have  been  reviewed.  The  next  section  presents 
the  two-part  hypothesis  which  this  research  addresses  and 
tests. 
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Figure  II. E. 2: 


A  SAMPLE  SOFTWARE  NET  GRAPH  AND 
INPUT  FILE  DESCRIPTION 
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M+J 


BEGIN  DYNAMIC  NET; 


MARK  GATE  WITH  1; 

COMMENT:  GATE  ENABLED  TO  ALLOW 
ONLY  ONE  OPERATION  IN 
PROGRESS  AT  ANY  TIME. 

EXECUTE  10; 

COMMENT:  EXECUTE  TEN  HW  CYCLES 

OR  UNTIL  PROGRAM  IS 
COMPLETE. 

END  DYNAMIC  NET; 


S5:  EXECUTE  10’ 

♦PROGRAM  EVENT  J+K  REQUESTS  HW  SVCS(1> 
♦PROGRAM  EVENT  M+J  REQUESTS  HW  SVCS(l) 

TIME  =  1; 

TIME  =  2: 

TIME  =  3: 

♦PROGRAM  EVENT  J+K  COMPLETES(3) 

TIME  =  4: 

TIME  =  5: 

TIME  =  6: 

♦PROGRAM  EVENT  M+J  COMPLETES (6) 

S6 :  END  DYNAMIC  NET; 


Figure  II. E. 3:  A  SAMPLE  INPUT  FILE  DYNAMIC  SECTION 

AND  OUTPUT  FILE  LISTING 
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III.  HYPOTHESIS 


Because  their  exist  several  data  flow  architecture 
proposals,  it  is  desirable  to  have  a  tool  with  which  to 
predict  the  performance  of  the  diverse  designs  for 
comparison  purposes.  The  first  part  of  this  research's 
hypothesis  was  that  the  Petri  net-based  Requester-Server 
(R-S)  methodology  is  such  a  tool,  capable  of  predicting  the 
performance  of  data  flow  architectures  in  an  efficient, 
accurate  manner.  In  effect,  the  R-S  tool  was  to  be  tested. 

The  second  part  of  this  research's  hypothesis  was 
concerned  with  the  Dennis-Misunas  data  flow  architecture 
design.  This  design  was  chosen  for  two  reasons.  First,  there 
existed  adequate  information  in  the  literature  about  this 
design  on  which  to  base  an  accurate  model  for  simulation 
purposes.  Second,  tne  Dennis-Misunas  design  of  the  basic 
Instruction  execution  mechanism  is  essentially  the  same  as 
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several  other  schemes  in  various  stages  of  implementation 
[DENNIS,  1979].  The  hypothetical  challenge  to  this  design 
was  that  the  goal  of  achieving  higher  speed  computation  is 
not  attainable  unless  a  nign  and  ’’intelligent”  degree  of 
multiprogramming  is  realized,  as  shall  be  explained  next. 

Obviously,  high  speed  computation  shall  require  a  high 
hardware  utilization.  By  this  it  is  meant  that  most  of  the 
processing  elements  (PE's)  shall  have  to  be  performing 


53 


useful  wort  most  of  tne  time.  Sucn  a  algo  nardware 
utilization  is  attainable  when  eitaer  of  two  situations 
occurs.  First,  a  alga  nardware  utiiizatioa  will  result  waen 
a  process  possessing  a  large  amount  of  inherent  parallelism 
is  being  run  (by  Itself)  on  tne  machine.  In  tnis  case,  a 
program's  execution  time  is  dependent  upon  its  amount  of 
inherent  parallelism  and  tne  number  of  PS's  in  the  machine. 
Second,  a  nlga  hardware  utilization  is  attainable  waen  a 
multiprogramming  environment  (in  which  several  processes  are 
permitted  to  simultaneously  run  on  the  machine)  is 
instituted.  In  sucn  a  multiprogramming  environment,  an 
individual  process  shall  be  competing  for  nardware  resources 
(PE's).  Taus,  tnat  process'  execution  time  may  be  lengthy 
regardless  of  its  amount  of  inherent  parallelism.  This  is  so 
because  that  process  may  have  tne  use  of  only  a  small 
portion  of  the  machine's  resources  (PE's)  at  any  point  in 
time.  Put  another  way,  if  at  any  time  a  process  has  N 
instructions  available  for  execution,  but  tnere  are  less 
than  N  PE's  available  for  executing  those  instructions  in  a 
parallel  fasnlon,  tnen  tne  process'  execution  time  shall  be 
lengthened  over  what  it  could  be  if  it  had  sufficient  PE's 


In  sucn  a  situation,  a  scneme  may  be  needed  to  Implement 
a  policy  which  achieves  two  objectives: 

1.  maintaining  high  hardware  utilization  and 

2.  providing  an  acceptable  average  response  time  for 
a  user  requiring  a  given  amount  of  processing. 

"Acceptable  average  response  time"  Is  construed  to  mean  that 
tne  actual  response  time  of  any  particular  program  which 
requires  a  given  amount  of  processing  snail  not  be 
lengthened  considerably  over  what  it  would  be  if  tne  program 
were  eiecuted  by  Itself  on  the  data  flow  macnlne.  Tnus,  It 
shall  be  desirable  to  minimize  the  affect  of  system  load  on 
an  Individual  program's  execution  time.  That  tne  second 
objective  should  be  met  even  at  the  expense  of  the  first 
objective  Is  a  strong  point  made  by  [KLEINROCK,  1976] .  By 
merely  mapping  processes  onto  the  data  flow  machine  as  they 
arrive,  it  Is  expected  that  objective  #1  shall  be  achieved 
but  at  the  expense  of  objective  # 2.  This  situation  was 
expected  to  be  demonstrated  by  this  research. 

The  purpose  of  this  section  has  been  to  "frame"  tne 
research  area  by  presenting  the  issues  which  give  rise  to 
the  hypothesis.  Tne  following  section  presents  the  method 
used  to  test  the  hypothesis,  and  Includes  a  discussion  of 
the  assumptions  male  to  facilitate  the  simulation 
experiments,  where  undecided  design  issues  remain. 


I?.  METHOD 


A.  EXPERIMENTAL  DESIGN 

The  experiment  to  alio*  prediction  of  data  flow  computer 
performance  involved  executing  sets  of  Petri  net  models  of 
data  flow  programs  on  Petri  net  models  of  data  flow 
hardware.  The  Requester-Server  (R-S)  program  tool  monitored 
the  data  flow  (model)  programs'  "execution”  and  provided 
data  which  permitted  the  determination  of  the  performance 
indices:  response  time  and  nardware  utilization.  It  is 
Important  to  realize  that  the  results  of  this  research 
predict  the  performance  of  a  model  of  a  lata  flow  computer- 
not  that  of  an  operating  data  flow  macnine  itself.  (Model 
validation  and  parameter  estimation  issues  are  addressed  in 
section  IV. B:  Data  Flow  Hardware  Definition.) 

The  reader  who  is  familiar  with  analytic  modelline 
employing  queueing  tneory  may  aslc  wny  tnat  technique,  rather 
than  the  simulation  technique*  is  not  used  to  predict  the 
performance  of  the  data  flow  design.  The  answer  is  that  the 
analytic  approach  unnecessarily  constrains  the  prediction  by 
requiring  assumptions  to  he  made  about  the  software. 
Specifically,  the  Petri  net  models  of  software  programs  are 
discretely  defined  with  regard  to  the  amount  of  inherent 
parallelism  available  for  exploitation  at  eacn  time  step  in 
program  execution.  To  model  analytically,  the  variability  of 
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Inherent  parallelism  available  for  exploitation  must  be 
described  by  probability  distributions  wn lea  aide  tne 
definable  nature  of  the  programs  at  discrete  time  steps. 

Tor  this  experiment,  tne  data  flow  arcnltecture  was  to 
be  modelled  wita  several  different  quantities  of  processing 
elements  (PE's).  Tne  sample  data  flow  program  models  were  to 
be  characterized  by  varying  but  definable  amounts  of 
inherent  parallelism  available  for  exploitation.  Each 
(model)  program  was  to  be  separately  run  on  each  (model) 
hardware  configuration.  (Hereafter,  the  word  "model"  shall 
be  omitted  but  assumed  in  referring  to  tne  program  and 
hardware  models  used  in  this  experiment.)  Data  was  to  be 
obtained  to  permit  determination  of  the  performance  indices 
(response  time  and  hardware  utilization),  for  each  run,  from 
tne  monitor  function  of  tne  R-S  tool.  After  running  each 
program  separately,  arbitrary  program  mixes  were  to  be  run 
on  eacn  hardware  configuration  and  the  same  performance 
indices  again  determined.  Finally,  hanl-optimlzed  program 
mixes  were  to  be  run  on  each  hardware  configuration  and  the 
same  performance  indices  determined  once  again.  By 
evaluating  the  results,  the  hypothesis  was  expected  to  be 
eitner  supported  or  refuted. 

The  Independent  variable  for  this  experiment  was  defined 
to  be  tne  quantity  of  PE's  available  to  tne  hardware  model. 
Because  PE's  are  but  one  resource  demanded  by  a  process  in 
execution,  other  Independent  variable  choices  could  nave 


included  other  resources  such  as:  the  quantity  of  cell 
blocks  available,  the  type  of  distribution  or  arbitration 
network  employed,  and/or  the  type  of  PE's  (multipurpose  or 
sets  of  single-purpose  functional  units)  utilized.  Expanding 
the  number  of  independent  variables  increases  significantly 
the  complexity  of  evaluating  tne  results.  How  these  issues 
were  resolved  is  explained  in  section  17. B. 

The  dependent  variables  for  this  experiment  were  the 
parameters  response  time  and  nardware  utilization.  The 
results  of  the  experiment  were  expected  to  provide  data 
which  could  be  plotted  on  graphs.  Curves  plotting  the 
execution  time  of  each  data  flow  program  against  the  number 
of  PE's  would  constitute  one  such  graph.  Others,  and  their 
significance,  are  presented  in  section  7:  Results  and 
Discussion. 

B.  DATA  FLOW  HARDWARE  DEFINITION 

The  Petri  net  models  of  the  data  flow  hardware 
configurations  were  quantified  In  terms  of  their  operation 
in  time.  Such  quantification  required  assigning  time 
duration  values  to  each  portion  of  the  cell  block 
architecture  model  in  such  a  way  as  to  closely  model  toe 
hardware.  Doing  so  required  several  assumptions  to  be  made. 
Those  assumptions  shall  be  addressed  individually  so  as  to 
help  substantiate  the  credibility  of  the  resultant  hardware 
models. 
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To  begin,  tne  processing  elements  (PE's)  were  assumed  to 
be  multifunctional,  capable  of  executing  any  instruction 
routed  to  it  in  one  "standard"  instruction  execution  time 
unit.  illowing  tne  PE's  to  be  multifunctional  and 
characterized  by  a  singular  execution  time  simplifies  tne 
modelling  process. 

There  were  at  least  two  other  possibilities  tnat  could 
be  accommodated  by  expansion  of  tne  Petri  net  models.  First, 
each  multifunction  PE  could  be  replaced  by  a  set  of 
single-purpose  PE's,  each  single-purpose  PE  defined  in  terms 
of  its  particular  instruction  execution  time,  and  capable  of 
executing  concurrently  with  other  PE's  of  the  set.  Second, 
each  PE  could  be  replaced  by  a  subnet  in  which  only  one 
instruction  could  be  executed  in  any  given  time  step,  but 
the  model  would  define  tne  execution  time  as  a  function  of 
tne  instruction  type. 

The  first  alternate  approach  implies  a  more  complex 
arbitration  network  witn  a  conceivably  longer  routing  time. 
The  second  alternative  would  require  additional  net 
complexity.  (However,  this  approach  would  be  a  good 
possibility  for  subsequent  research.)  Because  the  actual 
implementation  configuration  nas  not  been  finalized, 
modelling  tne  PE's  as  multifunctional  and  characterized  by  a 
singular  execution  time  was  a  reasonable  path  to  follow. 

The  distribution  network  design  also  has  not  been 
flnalllzed.  For  ease  of  modelling  purposes,  a  crossbar 


switch  design  capable  of  supporting  simultaneous  transfers 
of  result  packets  to  cell  blocks  was  cnosen  to  be  modelled. 
This  choice  permitted  a  standard  routing  time  to  be 
characterized  by  tne  model.  Other  network  designs, 
especially  packet  rout  in*  networks,  may  he  preferred  to  the 
crossbar  switcn  for  tne  ultimate  machine  because  of  their 
lower  cost  and  comparable  performance  in  a  data  flow 
architecture  [DENNIS,  1979]. 

Tne  choice  to  model  tne  PE's  as  multifunction  units 
precluded  the  need  for  anything  but  a  simple  arbitration 
network.  Such  a  network  would  merely  have  to  route  operation 
packets  to  any  available  PE.  Accordingly,  in  the  model,  a 
standard  routing  time  for  this  network  was  characterized. 

With  regard  to  the  cell  blocks,  tne  assumption  was  made 
that  sufficient  cell  blocks  were  available  to  hold  all 
portions  of  all  processes  being  run  on  the  macnlne  at  each 
and  every  Instant.  Thus,  there  Is  no  notion  of  paging 
portions  of  processes  Into  and  out  of  memory  (the  activity 
store  In  the  case  of  tne  data  flow  architecture).  This 
assumption  carries  with  it  the  assumption  that  all  program 
compilation  (resulting  In  extended  data  flow  graph-like 
representations)  is  complete  before  beginning  program 
execution.  Other  compilation  strategies  are  under 
consideration,  such  as  requiring  the  user  to  Interact  with 
the  system  to  achieve  a  high  degree  of  parallelism 
exploitation  [McSRAW,  1980] . 
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As  has  been  described, 


other  hardware  choices 


(representing  independent  variables  in  an  experiment)  can  be 
made  and  easily  implemented  by  simply  defining  appropriate 
subnets  vhicn,  in  a  time-wise  fashion,  characterize  the 
portions  of  tne  hardware  under  scrutiny.  The  approach  taicen 
in  this  research  permitted  the  hardware  timing 
characteristics  to  be  a  function  of  simply  the  number  of 
PE's.  Figure  IV.B.l  is  the  Petri  net  representation  (of  the 
cell  bloct  architecture  hardware)  utilized  in  this  research. 
For  the  purposes  of  this  experiment  and  in  the  configuration 
described,  each  PE  was  assumed  to  be  driven  at  the  rate  of 
two  million  floating  point  operations  per  second  (FlOPs),  a 
rate  claimed  to  be  reasonable  by  [DENNIS,  1980].  This  figure 
represents  an  instruction  execution  time,  of  500  nanoseconds 
(nsec).  (This  is  represented  in  the  hardware  model  by 
signifying  a  scaling  of  each  event/transition  pair  to  equal 
100  nsec.)  Associating  timing  characteristics  with  each 
component  in  the  data  flow  architecture  design  results  in  a 
similar  figure  as  shown  in  figure  I?. B. 2. 

PE  (instruction  execution)  50  nsec 

CELL  BLOCS  (memory  fetcn  assuming  MOS  technology)  250  nsec 
DISTBIBOTI ON  NETWORK  (assuming  crossbar  switch)  250  nsec 
ARBITRATION  NETWORK  (assuming  negligible) 

[WBITZMAN,  1980).  TOTAL:  550  nsec 

FIGURE  17. B. 2:  TIMING  CHARACTERISTICS  OF  DATA  FLOW 

ARCHITECTURE  COMPONENTS 


N 
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For  toe  purposes  of  tois  research,  toe  quantity  of  PE's  in 
toe  nardvare  was  varied  from  one  to  sixteen,  oy  multiples  of 
two.  This  resulted  in  data  venerated  for  toe  following 
quantities  of  modelled  PE's:  1,  2,  4,  8,  and  16. 

Tois  summarizes  toe  assumptions  utilized  in  developing 
toe  model  presented  nere.  Toe  following  section  describes 
toe  Petri  net  definition  of  tne  data  flow  software-  program 
models  wbico  were  "executed”  on  the  hardware  models. 

C.  DATA  FLOW  SOFTWARE  DEFINITION 

Toe  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  toe  amount  of  inherent  parallelism 
available  for  exploitation  at  eacn  discrete  time  step  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  tne 
programs.  (As  previously  mentioned,  the  data  dependencies 
define  tne  control  flow  of  a  program.)  Tne  initial  approacn 
involved  tatcin*  sample  programs  written  in  the  hlah  level 
language  (hll)  VAL  and  converting  them  to  their  equivalent 
Petri  net  representations  for  subsequent  "execution"  on  tne 
data  flow  hardware  models.  The  problem  with  this  approach 
was  that  tne  compilation  process  is  not  yet  developed.  Thus, 
what  hardware  instructions  would  be  required  for  each  hll 
instruction  were  not  determinable. 

Toe  subsequent  approach,  which  was  utilized,  involved 
designing  Petri  net  program  models  characterized  by  various 
but  discretely  definable  levels  of  inherent  parallelism. 
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Executing  suca  artificial  programs  conceivably  produced  more 
Informative  results  than  would  nave  been  obtained  wlta  a  few 
select  programs  wbicb  may  nave  only  demonstrated  data  flow's 
suitability  for  tnose  special  purpose  computations.  Tne 
individual  programs  shall  be  cbaracterl zed  after  introducing 
a  new  concept. 

A  new  concept  introduced  at  tnls  point  Is  that  of  a 
software  "concurrency  vector".  A  concurrency  vector  Is  a 
tuple*  each  entry  of  which  defines  the  amount  of  inherent 
parallelism  In  a  program  at  tne  operation  packet 
hierarchical  level*  at  a  discrete  instruction  execution  time 
step.  Each  entry  of  the  tuple  is  Implicitly  subscripted  by 
tne  time  step  it  describes.  For  example,  tne  simple 
statistics  function  of  section  II. C  (see  figure  II. C. 2,  page 
32)  would  be  characterized  by  the  concurrency  vector: 
(4*2,2*2*1*1 ) .  In  this  example  concurrency  vector,  tne  "4" 
represents  the  fact  that  the  four  operations  ”SQ", 
"St)"*  and  "SO”  could  be  processed  In  parallel  during  tne 
first  time  step  of  execution  of  the  simple  statistics 
function.  This  is  so  because  no  sequencing  constraints  exist 
among  these  four  operations.  Thus  the  concurrency  vector 
defines  now  many  operation  packets  could  be  parallel 
processed  if  all  the  instructions  (i.e.  functions-  addition, 
subtraction*  division,  square,  square  root)  were  Implemented 
In  nardware.  (If  they  were  not*  the  subfunctlonai  operation 
packets  required  by  the  Instruction  would  be  considered  in 
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defining  the  concurrency  vector  entries.)  It  saould  also  be 
recognized  that  the  concurrency  vector,  though  a  function  of 
a  program,  Is  dependent  upon  a  standard  Instruction 
execution  time  duration.  If  the  hardware  is  implemented  such 
that  execution  time  Is  a  function  of  the  Instruction  type, 
then  the  concurrency  vector  entries  could  be  described  at  an 
even  lower  level  than  the  operation  packet  level.  Such  a 
level  would  correspond  to  a  basic  hardware  cycle  time,  where 
executing  an  Instruction  would  require  some  number  greater 
than  one  hardware  cycles  to  complete.  This  additional 
complexity  need  not  be  considered  in  this  research  In  view 
of  the  hardware  design  approach  taken,  but  could  be 
accommodated  by  the  R-S  methodology  used  here. 

Four  programs  ("a"  through  ”D”)  were  utilized  in  this 
research.  These  programs  are  differentiable  by  tnelr  length 
as  well  as  by  the  amount  of  Inherent  parallelism  available 
for  exploitation  at  each  time  step.  Tne  Petri  net 
representations  of  these  programs  are  shown  in  figures 
I7.C.1  through  17. C. 4.  Additionally,  the  concurrency  vector 
for  each  is  shown.  The  program  mixes  for  this  experiment 
Included  one  of  each  of  the  three  programs,  "a",  "B”  and 
"D".  Another  program  mix  Including  one  of  each  of  the  four 
programs,  "a"  through  "D",  was  also  used. 

The  following  section  presents  the  procedure  utilized  in 
executing  the  experiment.  Additionally,  the  method  of 
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FIGURE  IV. C. 2: 


(MODEL)  PROGRAM  "B" 
CV :  (  8 , 8 , 8 , 8  ) 
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^  IK,  v  v  t 


mapping  tne  program  mixes  onto  eacn  of  tne  nardware 
configurations  is  explained. 

D.  PROCEDURE/IMPLEMENTATION 

Tne  four  procedural  steps  utilized  in  executing  tnis 
experiment  were  as  follows.  First,  Petri  net  models  of  both 
tne  nardware  configurations  (figure  IV.B.l)  and  software 
programs  (figures  I7.C.1-.4)  were  converted  to  a  format 
acceptable  as  input  to  tne  Requester-Server  (R-S)  program. 
Two  Pascal  programs,  compiled  and  executed  on  the  NPS 
"B-side"  PDP-11  (a  DNIX-based  system),  facilitated  tne 
(separate)  generation  of  tne  hardware  and  software  portions 
of  the  input  files  for  tne  R-S  program.  Eacn  input  file  was 
formed  by  concatenating  the  hardware  and  software  portions 
and  then  editing  tne  resulting  file  to  define  the  dynamic 
execution  desired.  The  second  step  was  to  transfer  each 
complete  input  file  from  the  NPS  "B-side"  to  the  NPS 
”A-side"  PDP-11  (an  RSX-llM-based  system),  via  an 
inter-processor  linn.  Thirdly,  the  R-S  program  was  run  on 
the  "A-side",  taking  as  input  tne  file  wnlch  defined  the 
hardware,  software,  and  dynamic  execution  desired.  Fourth 
and  finally,  data  regarding  the  execution  of  the  software  on 
the  hardware  was  obtained  from  the  output  file  generated  by 
tne  R-S  program.  Tne  results  of  the  data  analysis  are 
presented  and  discussed  in  section  7  (Results  and 
Discussion) . 
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The  Implementation  portion  of  tnis  section  addresses  tne 
technique  used  to  map  the  software  onto  the  hardware  in  such 
a  way  as  to  effectively  simulate  tnis  function  as  it  mignt 
be  done  on  a  real  data  flow  machine.  The  first  set  of 
experimental  "runs",  wnicn  consisted  of  tne  separate  running 
of  each  program  on  each  hardware  configuration,  was 
straightforward  in  implementation.  The  procedure  described 
above,  in  whicn  each  program  file  portion  was  concatenated 
with  the  appropriate  hardware  file  portion,  achieved  a 
relevant  mapping  for  modelling  a  single  process  running  on  a 
particular  hardware  configuration.  Tne  subsequent  set  of 
experimental  "runs",  in  wnicn  a  program  mix  was  mapped  onto 
each  of  the  hardware  configurations,  was  not  so 
stralgntforward  in  implementing  as  snail  be  explained  next. 

To  understand  the  mapping  of  software  (i.  e.  processes) 
onto  data  flow  hardware,  it  is  helpful  to  scrutinize  the 
functions  of  tne  operating  system  for  sucn  a  macnlne. 
Because  the  scheduling  and  synchronization  of  concurrent 
activities  are  built  in  at  tne  hardware  level,  a  data  flow 
machine's  operating  system  will  only  be  responsible  for 
Initialization,  termination,  and  input/output  (I/O)  of 
processes.  Once  a  process  is  mapped  onto  the  data  flow 
machine,  it  runs  to  completion  without  further  intervention 
by  the  operating  system  (except  for  I/O).  Tne  question  which 
must  be  answered  is:  When  should  another  process  be  mapped 
onto  a  machine  which  is  already  executing  one  or  more 
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processes?  Thus,  in  defining  the  Input  file  of  a  program  mix 
representative  of  ready  processes,  the  program  mix  had  to  be 
defined  In  terms  of  a  mapping  function. 

The  mapping  functions  can  be  thought  of  as  operating 
system  assignment  policies.  Thus,  for  those  runs  Involving 
program  mixes  (as  opposed  to  single  programs),  an  assignment 
policy  had  to  be  simulated.  One  aspect  of  this  research  then 
can  be  viewed  as  an  investigation  of  different  policies  for 
mapping  processes  onto  data  flow  machines  in  a 
multiprogramming  environment. 

Each  program  mix  consisted  of  the  three  programs  "a", 
"b"  and  "D".  (A  later  "run”  for  which  data  was  gathered 
utilized  the  four  program  mix  consisting  of  one  each  of  the 
programs  "a"  through  "D".)  Bach  mix  was  varied  in  the  way  In 
which  It  was  mapped  onto  the  hardware.  In  simulating 
different  operating  system  mapping  functions.  The  operating 
system  assignment  policies  for  mapping  a  program  mix  onto 
the  hardware  configurations  follow.  Three  policies  were 
simulated.  First,  the  three  programs  were  permitted  to  begin 
"execution"  at  the  same  time.  Second,  an  "80*  Rule"  was 
simulated  in  wnlcn  an  additional  program  was  permitted  to 
begin  "execution"  whenever  the  hardware  utilization  dropped 
below  80%.  Third,  an  "intelligent"  assignment  policy  was 
Implemented  via  a  mapping  function  based  on  the  programs' 
concurrency  vectors.  This  assignment  policy,  it  was 
envisioned,  would  cause  optimal  performance  in  terms  of  the 


performance  indices:  response  time  and  hardware  utilization. 
The  concurrency  vector  approach  optimizes  the  assignment  of 
processes  onto  the  macnine  by  fitting  together  concurrency 
vectors  of  ready  processes  in  such  a  way  that  the  objectives 
noted  in  section  III  are  achieved.  For  example,  given  a 
machine  with  eieht  PB's,  the  concurrency  vectors  would  be 
fitted  as  shown  in  figure  IV.D.l. 


JOB  A 
JOB  ”B’ 
JOB  ”C’ 
JOB  "D' 
JOB  "E‘ 

TOTAL: 


(4, 3, 2, 1,2, 3,4) 

(3, 5, 7, 6, 5, 2) 

(4*4, 4,2,1) 

...,8,4,4) 

_  (2,7,8,. 

... ,8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 8,. 

=*===time=====> 


FIGURE  IV.D.l:  AN  EXAMPLE  OF  "FITTING"  CONCURRENCY 

VECTORS  TOGETHER 


By  generating  tne  concurrency  vectors  at  compile  time,  the 
program  can  declare  beforehand  those  resources  (as  a 
function  of  time)  needed  for  execution  as  well  as  when  the 
program  will  be  completed.  An  operating  system  can  thus 
choose  the  sequencing  of  the  running  of  the  waiting 
processes  to  achieve  the  best  fit  to  best  meet  tne 
objectives  of  section  III. 

The  results  of  tnese  experimental  runs  are  presented  in 
the  following  section  in  graphical  form.  Additionally,  the 
meaning  and  significance  of  tne  results  are  discussed. 
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7.  RESOLTS  AND  DISCUSSION 


In  response  to  tae  first  part  of  tnis  researcn's 
nypothesis,  It  Is  proposed  taat  the  Petri  net-based 
Requester-Server  (R-S)  methodology  is  indeed  a  desirable 
tool  with  which  to  predict  the  performance  of  the  diverse 
designs  of  data  flow  arcni tectures .  The  ability  to 
separately  specify  the  hardware,  software  and  resource 
allocation  policy  was  an  R-S  feature  which  permitted 
efficient  generation  of  tne  combinations  of  the  above  three 
items.  The  ability  to  easily  implement  variable  levels  of 
detail  in  both  the  hardware  and  software  was  not  exploited 
but  the  method  for  doing  so  was  introduced.  Finally,  tne  R-S 
methodology's  capability  of  simulating  concurrency  and 
asynchronous  benavlor  in  both  tne  nardware  and  software  is  a 
necessity  for  accurately  modelling  and  simulating  data  flow 
computing. 

The  results  which  address  the  second  part  of  tnis 
research's  hypothesis  are  now  presented.  To  begin,  figure 
7.1  snows  individual  program  execution  times  as  a  function 
of  the  number  of  processing  elements  (PE's).  These  absolute 
execution  times  were  used  as  a  basis  for  comparison  with  the 
results  from  the  multiprogramming  environment  runs.  Percent 
hardware  utilization  is  displayed  adjacent  to  each  data 
point.  The  hardware  utilization  values  are  averages  of  tne 
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hardware  utilizations  at  each  time  step  during  execution. 
Tne  graph  snows  tnat  a  program's  execution  time  can  be 
drastically  reduced  by  increasing  the  number  of  processing 
resources  (PE's)  available  up  to  tne  point  wnere  execution 
time  is  bounded  by  the  amount  of  inherent  parallelism 
available  for  exploitation  in  tne  program. 

The  following  results  pertain  to  the  running  of  tne 
program  mixes  in  a  simulated  multiprogramming  environment. 
Initially*  it  was  intended  that  program  mixes  would  contain 
a  greater  quantity  of  programs  than  were  actually  run.  This 
was  not  acnieved  due  to  time  constraints.  Accordingly,  tne 
results  should  be  considered  preliminary  in  nature.  On  a 
positive  note,  the  results  provide  insight  into  several  data 
flow  operation  issues.  Figure  7.2  provides  tne  raw  data  for 
this  research  with  the  exception  of  the  data  utilized  in 
computing  nardware  utilization.  Figures  7.3,  V.4,  and  7.5 
present  the  hardware  utilization  (as  a  function  of  time)  for 
tne  4-,  8-,  and  16-PE  configurations.  Similar  graphs  for  the 
1-  and  2-PE  configurations  are  presented  in  figure  7.6  (note 
the  identical  nature). 

The  implications  supported  by  this  data  follow.  They  are 
not  definitive  because  of  the  small  quantity  of  programs  and 
program  mixes  in  the  model.  Thus,  wnile  tne  second  part  of 
the  hypothesis  may  not  have  been  adequately  tested,  the 
methodology  for  doing  so  appears  to  be  available  in  the  R-S 


sep.  CV  80%  ABT 


sep.  CV  80%  ABT 


A 

9.5 

5.5 

9.0 

12.5 

17.5 

3.5 

4.5 

6.5 

9.0 

B 

16.0 

9.0 

17.5 

13.0 

14.5 

4.0 

9.0 

6.0 

7.5 

C 

20.0 

10.0 

— 

— 

- 

6.0 

D 

9.0 

5.0 

11.5 

17.5 

16.0 

3.0 

7.0 

9.5 

8.0 

AVG 

RESPONSE 
TIME  OF 

3  PGMS 
AVG 

H¥  OTIL 

13.6 

6.2 

12.7 

14.3 

16.0 

3.5 

6.8 

7.3 

9.2 

(%) 

100 

99 

99 

99 

' 

96 

91 

95 

9  PE' 

S 

16  PE 

's 

sep. 

cv 

90% 

ABT 

sep. 

CV 

80% 

ABT 

A 

3.5 

3.5 

3.5 

5.0 

3.5 

3.5 

3.5 

3.5 

B 

2.0 

4.5 

3.0 

4.0 

2.0 

2.0 

2.0 

2,0 

C 

5.0 

- 

- 

5.0 

- 

- 

— 

D 

2.5 

4.0 

5.5 

4.5 

2.5 

3.0 

3.0 

3.0 

AVG 

RESPONSE 
TIME  OF 

3  PGMS 
AVG 

HV  UTIL 

2.7 

4.0 

4.0 

4.5 

; 

2.7 

2.8 

2.8 

2.8 

(%) 

96 

78 

86 

62 

62 

62 

16  PE 

'S,  4 

-(MODEL)  PROGRAM  MIX 

sep. 

CV 

80% 

ABT 

A 

3.5 

3.5 

3.5 

3.5 

B 

2.0 

3.5 

2.0 

2.0 

C 

5.0 

5.0 

5.5 

6.0 

D 

2.5 

2.5 

4.0 

3.5 

AVG 

4  PGM 

RESPONSE 

TIME 

3.25 

3.6 

3.75 

3.75 

AVG 

HV  UTIL 
(%) 

68 

62 

57 

FIGURE  f .2:  EXPERIMENT  DATA  (ALL  f ALOES  IN  pS . ) 

77 


78 


WZOHH>NHrH4C 


tool.  Additionally,  it  is  maintained  tnat  the  initial 
results  support  tne  discussion  which  follows. 

Whenever  the  amount  of  concurrency  in  all  running 
processes  exceeds  tne  number  of  PE's  available  to  meet  the 
requirements  of  the  processes,  some  slowdown  in  execution 
time  results  for  some  processes.  The  dual  of  this  result  is 
that,  so  lone  as  there  are  adequate  PE's  available,  no 
slowdown  in  any  process'  execution  time  results. 

Under  the  "All  Begin  Together”  (ABT)  assignment  policy, 
the  data  flow  hardware  becomes  "overloaded",  resulting  in 
the  slowdown  Just  described.  For  example,  program  "a”, 
though  the  first  to  begin  processing  under  the  ABT  scheme, 
is  the  last  to  finish  under  the  three  program  mix.  Under  the 
four  program  mix  with  16  PE's,  the  "c"  program  taxes  longer 
only  because  of  its  great  length  and  inherent  parallelism. 

Under  the  "80%  Rule"  assignment  policy,  the  average 
hardware  utilization  is  lower  than  tnat  under  tne  ABT  policy 
(for  the  three  program  mix).  Also,  the  average  of  the  three 
programs'  execution  times  is  lower  tnan  that  under  tne  ABT 
policy.  For  the  four  program  mix,  the  average  hardware 
utilization  is  sligntly  greater.  This  reflects  a  better 
mapping  of  processes  onto  the  machine.  (Average  nardware 
utilization  Is  defined  as  the  average,  over  the  duration  of 
a  run,  of  the  hardware  utilization  percentages  at  each  time 
s  tep  of  tha  t  run. ) 
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Under  the  optimized  concurrency  vector  (CV)  approach, 
programs  were  mapped  onto  the  hardware  configurations  in 
such  a  way  as  to  achieve  a  high  hardware  utilization  at  eacn 
time  step  as  well  as  minimize  the  average  response  time  of 
tne  programs  in  the  mix.  The  results  indicate  average 
hardware  utilizations  at  least  as  high  as  under  either  of 
tne  otner  assignment  policies.  Also,  the  average  response 
times  were  at  least  as  low  as  under  either  of  the  other 
assignment  policies.  This  optimized  concurrency  vector 
approach  should  be  suitable  for  machine  optimization.  Using 
concurrency  vectors  generated  at  compile  time,  the  mapping 
of  additional  processes  onto  tne  data  flow  machine  should 
probably  continue  only  so  long  as  acceptable  average 
response  time  for  any  process  is  not  exceeded.  When  a 
process  characterized  by  mor*  inherent  parallelism  than  can 
be  currently  accommodated  on  the  data  flow  machine  is 
awaiting  assignment  (i.  e.  mapping  onto  the  data  flow 
machine),  that  process'  assignment  should  be  delayed  until 
sufficient  (or,  if  necessary,  all)  PS's  are  available  to 
parallel  process  the  computation.  (A  user  advisory  denoting 
such  a  delay  would  be  highly  desirable.) 

The  time  spent  in  preprocessing  Jobs  in  accordance  with 
any  assignment  scheme  to  achieve  some  level  of  optimization 
may  be  unnecessary  or  even  wasteful.  This  trade-off  will 
have  to  be  examined  in  greater  depth. 
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VI.  SUMMARY 


Following  a  review  of  the  pertinent  literature,  a 
two-part  hypothesis  was  proposed.  First,  the  Petri  net-oased 
Sequester-Server  (R-S)  methodology's  suitability  for 
predicting  the  performance  of  data  flow  macnines  was  to  be 
tested.  Second,  it  was  nypotnesized  that  tne  goal  of 
economically  achieving  higher  speed  computation  through  data 
flow  computing  would  be  unattainable  without  achieving  a 
high  and  intelligent  degree  of  mul tiproerammine.  The  R-S 
methodology,  a  simulation  technique,  permits  the  separate 
specification  of  the  hardware  to  be  evaluated,  the  software 
to  be  used  in  the  hardware  evaluation,  and  the  policy  for 
allocating  hardware  resources  to  program  requests  for 
service.  Accordingly,  Petri  net  models  of  data  flow  hardware 
configurations  were  quantified  in  terms  of  tneir  execution 
in  time,  and  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  the  amount  of  inherent  parallelism 
available  for  exploitation  at  each  discrete  time  step,  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  the 
program.  Model  programs  were  "run"  on  model  nardware 
configurations.  Results  obtained  from  the  monitor  function 
of  the  R-S  program  were  analyzed,  with  respect  to  tne 
performance  indices:  hardware  utilization  and  response  time. 
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Three  assignment  policies  for  determining  when  to  map 
additional  programs  onto  a  data  flow  macnine  were  tested: 

1.  all  programs  begin  togetner 

2.  assign  an  additional  program  whenever  the 
hardware  utilization  drops  below  S0£ 

3.  assign  an  additional  program  based  on  a 
concurrency  vector. 

Results  show  that  the  R-S  methodology  is  indeed  an 
efficient  and  easy-to-use  tool  for  investigating  data  flow 
architectures.  Also,  initial  results  indicate  tnat  optimized 
scheduling  based  upon  concurrency  vectors  is  viable  for 
deciding  when  to  map  additional  processes  onto  a  data  flow 
machine  to  achieve  the  objectives  of  maintaining  hien 
hardware  utilization  and  providing  acceptable  average 
response  time. 


VII.  RECOMMENDATIONS  FOR  FURTHER  RESEARCH 


Vita  regard  to  the  methodology,  worthwhile  additions  to 
tae  R-S  program  would  be  user-friendly  "front-"  and 
"back-ends"  waich  would  further  simplify  both  the  veneration 
of  Input  flies  for  tne  R-S  tool  and  the  retrieval  of  desired 
data  from  the  output  file  venerated  by  each  run. 

In  the  area  of  data  flow,  simulations  in  which  the 
hardware  definition  of  the  architecture  was  varied  (as 
described  in  section  IV. B)  could  provide  insivhts  revardinv 
the  optimal  hardware  configuration  for  tne  expected  program 
load.  In  particular,  the  quantity  of  (modelled)  PE's  should 
be  increased  to  a  number  closer  to  tne  amount  expected  In 
the  actual  machine  (approximately  512).  Of  course.  In  order 
to  model  more  accurately,  the  expected  load  in  terms  of  the 
quantity  of  programs  and  typical  amounts  of  inherent 
parallelism  shall  have  to  be  defined  more  exactly. 

A  final  open  area  within  data  flow  research  Is  the 
development  and  testlnv  of  specific  algorithms  using 
concurrency  vectors  to  permit  machine  optimization  and 
Implementation  of  a  desirable  assignment  policy. 
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